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TaeeeNTRODUCTION 


The first Reduced Instruction Set Computer (RISC) 
appeared at the end of the 1970's and since then long and 
heated discussions have taken place in the computer archi- 
tecture community. These discussions centered around the 
validity of the claims made by the RISC proponents regarding 
the performance achieved by the proposed machines when 
compared to traditional computers that are referred to as 
Semplex Instruction Set Computers (CISC). 
| Due to a lack of an appropriate method to evaluate the 
performance effects of various architectural features, it is 
eerrrcult to resolve the RISC/CISC controversy. 

The interest in the ideas proposed by this philosophy 
has been growing, and presently many of the major computer 
companies are investing a great deal in this new type of 
computer architecture. 

This thesis tries, first, to define the basic character- 
istics of a Reduced Instruction Set Computer, So that 1t 1s 
possible to focus on the specific architectural features 
peculiar to RISC machines. 

The approach that in the author's opinion has to be 
followed, ie, OLaer |) tO evaluate computer performance, 
together with the author's disagreement on the approach 
taken on several published comparisons between RISC and CISC 
machines, are presented. 

A model for computer performance evaluation is 
suggested. This model is composed of two parts. Ne yea iesus 
part deals with the timing analysis of the computer perftorm- 
ance. The second part sets a criterion to determine the 
efficiency of a given computer control unit when used for a 
given application. Finally in order to evaluate the model, 


an example is given demonstrating the quantification of the 


performance effects of an architectural enhancement to. a 
system architecture. 

The model suggested for computer performance evaluation 
constitutes a departure from the current computer perform- 
ance evaluation methods, because the attention is centered 
on the computer architecture rather than on the measurements 
of throughput, response time and mean job turnaround time 
where the main emphasis of the evaluation process is put on 
the software. 

The model is intended to provide a tool for Conpue-s 
architects to use, so that discussions regarding the 
performance achievements of certain architectural features 


might be quantified and rational conclusions may be reached. 





A, INTRODUCTION 

In recent years a new type of computer architecture has 
received a great deal of attention. 

This new architecture is mainly the result of an effort 
conducted in an academic environment. Profiting from the new 
possibilities that custom VLSI offers, the professors and 
students at the University of California at Berkeley, 
collaborating in several courses in this area, began 
meej,eccts .on building single chip computers. 

Due to limitations of the chip area, available tools and 
the available time for the completion of the project, 
several simplifications to contemporary architectures were 
made. For example, the instruction set was simplified by 
eliminating all instructions that might be called composite 
Meseruccions. [his type of instruction is equivalent, in the 
operation performed, to a sequence of other more elementary 
(atomized) instructions. 

A claim has been made, that the obtainable performance 
of these machines was unexpectablly remarkable and this 
triggered a major discussion on the subject of the merits of 
RISC's. 

Feeding the controversy is undoubtly the lack of an 
appropriate method or tool to measure computer architecture 
performance andthe effects of a particular architecture 
modification on the computer performance. 

From the very beginning the RISC machines were related 


to implementation issues in the use of VISI technology. 


Proponents called the approach "RISC", for Reduced 
Instruction Set Computers, as opposed to the traditional 
Computers which they referred to as "C1Isc's", fou seomp lex 


Piserucelom. oot Computers. 


ce 


The "new architecture" proponents didn't present it asa 
proposal to enhance, in some way, the prevailing architec- 
ture, but as a complete departure from the previous work. 

No precise definition has ever been given for the 
complete characteristics of a RISC machine, and because of 
that, there are now in existence several different machines 
all claiming to be RISC's. Although there are some common 
features there is no clear cut agreement on what comprises a 
reduced instruction set computer. 

No doubt some very valid ideas were brought to the 
computer architecture environment by the "RISC philosophy 
proponents", but, nevertheless, it constitutes a sure risk 
to accept a new idea without an open, substantiative debate 
where the benefits are separated from the jargon. 

The first step in understanding and identifying the RISC 
trade-off is a more precise definition of RISC. 

As stated above, several implementations of RISC's are 
already in existence, and, of these, four have undoubtly 
enough importance to be mentioned. 

They are: 


1 bie wi SC mee aac le developed at the Universae 7 ae= 
California at Berkeley 


2) The 801 Minicomputer, developed at the IBM Thomas 5S. 
Watson Research Center 


3) The MIPS, developed at Stanford University. 
In order to develop a definition of the "RISC" the 
existing "RISCs" should be studied. 


Be. SHERI sce aie 

The RISC I and II were both developed at the University 
of California at Berkeley where the acronym RISC originated. 
Since both were developed at U. C: Berkeley, they are very 
Similar in there eomposie1on- ln: face, RISC II 1s no more 
than an enhanced version of RISC I. 

Both are Single Yehip “VES Ssocescors having the 


following charaGeera sre >. 


1) They are 32-bit machines... That is, all registers and 


busses are 32 bit wide. 


Poe lnsteruction Set 


Bh) 


3a) Total number of on-chip registers 
Rise 


+ ) 


>) 


Za) ee I has Sh instrueelons 


RISG 11 Has. 39 instructions 


2b) Both have -a load/store architecture. This means 


that all instructions except load and store are 
register-to-register. Load and store are the only 
memory-reference instructions. 


2c) All_instructions except LOAD and STORE are single- 


cycle where a cycle is the time it takes to read 
and add two registers, and then store the result 
back into a regaster. 


2d) All instructions are the same size (32 bits). 


There are two different formats but the fields are 
at fixed locations. 


ze). Addressing Modes: 


There are two addressing modes; one for register- 
Eee) scorers eLrUCtlONe--negrister Direct and the 
other for memory reference instructions--Index + 
Displacement. 


Registers 


RISC II e-- 198 


3b) The processor is organized in multiple overlapping 


windows in order to facilitate parameter passing 
between procedures. 

The windows are organized in a circular buffer 
fashion. In the case that the nested procedure 
depth is greater than the number of windows minus 
one, the values in the window corresponding to the 
oldest procedure are stored in poy and this 
Vieoerno coc mree =O be wallocarted to the current 
procedure. ewe eel Gmc Gis tel> abe visible 
constituting what 1s called the "current window’. 
All windows have a fixed size and the composition 
shown in Figure 2 

The peo ose registers are common to all procedures, 
and therefore ee are Used to store global vari- 
ees es oRegiccer KO holds a fixed value of zero. 
HicmroweLeqistems are common to the current proce- 
dure and to the called procedure, although, in the 


called procedure, they will have a different 
number since there they constitute the high regis- 
ters of the corresponding window. The high regis- 


ters are common to the current procedure and to 
the ee eS ae iiowaiein ana Low beqisters 
gon wit he global registers constitute the 
over aoe part of each window and are used for 
Parane cer ee ode Be CWeen mp LOceCcures. INotey Ble ell 
registers are only visible in the current window. 


Picmecormerol Unt ais hardwired with most of 1ts logic 
implemented using PLA's. 


Pipeline Stages | 

the RitsGel has two pipeline stages, i.e., depending 
Seeciemereg@anm sequence it Can prefetch the next 
iis Gate Gi On while ays executes the present 


io 


6) 


8) 


el SG om ae 


HIGH REGISTERS 


R28 


Q27 

LocAlL RE GIsTeRs 
| 222 
Q2\ 


LowW REGISTERS 
R18 


GLOBAL QEaisTEes 


R# 


Rise I 
HIGH REGISTERS 


LOCAL Meco 
Ric 
Low REGISTERS 
%\o 


QLOBAL REGISTERS 
Ro 





sc EE Eee ee 


ELGure = 2a RISC Register Window. 


oe ew The RISC II has three pipeline stages, 

espendiud on the pregram Sequence lt eeca 
prereten he next instruction and store the final 
results of the previous instruction in a register, 
while it executes the present instruction. 


Use of Delayed Branch 
In order to increase speed and not to discard the 


Prefetch tnsctruce1on, when a branch instruction is 
executed, the Braneh takes place senuy satter eae 
execution of the next sequential INSCructiem 


See the compiler arranges for the instructitem 
rs ee the branch to be part of the loop, see 
ice Q 


Implementation 

RIS I is implemented with 4 micron NMOS VLSI 
technology with a clock of 8 MHZ “and "ay cycle of ove 
NSEC. RISC El is implemented with So mieren NMOS VLSI 
ee cane oom with a clock Of 42, MHZ and a®cyeleroracco 


14 


Pe Oui oem eana ff have no floating-point support. 


fee toe SOL MINICOMPUTER 

Developed by IBM at the Yorktown Heigths Research Center 
from 1975 until 1979, it was the first machine to follow 
what later would be called "The RISC Approach to Computer 
Architecture". 

Due to its proprietary nature, not much is known about 
it, but some of the ideas present in its design are known 
and have been, in a certain way, the basis for the develop- 
ment of RISC I and II at Berkeley and MIPS at Stanford. 

As opposed to the RISCs and the MIPS, the 801 is not a 
Single chip processor but a minicomputer. 

The general approach is the basis for the design of an 
IBM NMOS VLSI single chip processor known as ROMP or 802. 

The 801 machine is basically a 32 bit architecture with 
Single-cycle four byte instructions and 32 registers. It has 
separate data and instruction cache memories. As in RISC I 
and II, the 801 also has a delayed branch scheme, that is 
the branch only takes place after the execution of the next 
iesecruction. 

The 801 system is said to be compiler<=based meaning that 
a greater demand is made on the compiler. | 

The 801 architecture was defined by George Radin in his 
article 'The 801 Minicomputer' [Ref. 2] as the set of run 
time operations which: 

1) Could not be moved to compile time 

2) Could not be more efficiently executed by object code 

ee tee ogee oe Se Be 

3) Was to be implemented in random logic more effec- 

eee the equivalent sequence of software 

Both data and address busses are 32 bit wide. The 
addressing modes are few: 

- base+index 


- base+displacement 


t= 


= register direce. 
Also a highly-effective optimizing compiler was devel- 
oped for the system. 


De DHE Mies 

The MIPS computer was developed at Stanford University 
by John Hennessy and his students. Its acronym stands for 
Microprocessor without Interlocked Pipe Stages. 

There are strong similarities with the RISC project at 
Berkeley. It has, however, some conceptual differences that 
have already been identified by its proponents in Ref. 3 as: 

1) more complex user level instruction set. 

11) the main design goal is high performance of the 
hardware employe and not Simplicity of (eae 
INSTCEUC Mon voc rE: 

iii) much more complex compiler. 

Specifically its characteristics are the following: 

1) 32 Siseeinae nage: 

24) InStrticetonesee 

Za) So JnserUeelLons 

2D) bOad/ StOre=archteee cute 


Ze. PP Pea Oe except LOAD and STORE are single- 
cycle 


2d) Instructions may be 16 or 32 bit long. 1 An epee 
ae compiler reorders the instructions Soma. 
all lo bit instructions always come in pairs. 


Ze) BOCs ate Modes 
- immediate 

Base ew ven ekl "© Easier 

indexed 

base shift 


3). Reais Cem . . 
There are sixteen 32-bit general purpose registers, 


4) Hardwired control with most of its logic implemented 
using PLA’ s 


5) Use of Delayed Branch instructions 

6) Five pipeline stages 

7) No condition codes 

8) Word-addressable machine 

9) Separate data and instructions memory 


10) No support ter floating-point veoe-reetoms 
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11) Implemented with 4 micron NMOS VLSI technology with 
a clock rate of 8 MHZ. 


E. TOWARD A DEFINITION OF A RISC MACHINE 

Four machines have been described as examples of a new 
type of computer architecture defined as the RISC architec- 
ture, as opposed to the traditional architecture now 
referred to as CISC architecture. 

Any definition of this architecture will have to encom- 
pass the characteristics common to the four previous 
examples. 

To summarize, a RISC Machine will have the following 
characteristics: 


1) Simple instruction set where the great majority of 
the instructions are single-cycle, 


Zyeiead7vstore architecture, that is all instructions are 
register-to-register with the LOAD and STORE being 
the only memory-reference instructions, 

3) Very few addressing modes, 

4) Hardwired Control l.e. no microcode, 


Mminctructioms with one or two sizes and with fields at 
Pe oumrecatlilons , 


6) Some degree of pipelining, 


7) Demand on the compiler to increase performance. 


alee 


Iii. MY APPROACH TO COMPUTER PEREORMANCE EVALUATION 


A, INTRODUCTION 

This thesis has been motivated by the rise of the new, 
RISC computer architecture trend, described in the previous 
chapter,.-and by the claims made by RISC proponents regarding 
the inherent superior performance of RISC when compared to 
traditional architectures. 

Unfortunately, the claims made for these structures were 
not supported by any quantitative arguments. No specific 
attention was given to the effects of various factors intro- 
duced in the RISC architecture and to the influence that 
each factor had on the system performance. 

Computer performance evaluation is different depending 
on the aspects of performance being evaluated. From the 
view point of a potential computer system buyer, there is a 
need to identify features in the system which will enhance 
the performance for a particular application. From the 
viewpoint of a computer architect, performance analysis is a 
way to evaluate specific enhancements from which trends in 


computer architecture design may follow. 


B. EVALUATION AND MEASUREMENTS 

In order to perform an evaluation of any kind, one must 
take measurements of the system under different conditions. 
One wants to take the measurements properly, or else the 
evaluation will be unvalid. 

In order to guarantee that the evaluation will be based 
upon correct data, one has to know: 

1) What the measurements are for 

The buyer is not worried about any of the architectural 
details of the machine, but rather about the throughput of a 


system programmed in a high-level language. 
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im contrast, EMCucembDUuLen area tect must be concerned 
with the internal characteristics and the behavior of the 
system, even when he is testing a system using programs 
written in high-level languages. 

Considering the RISC family of machines the correct 
point of view is undoubtly the latter one. 

Z) What is measured 

Typically one wants to test how each enhancement to the 
computer architecture affects the system performance. in 
order to get a realistic comparison of features, only one 
feature at a time may differ. If more than one feature is 
different, it is difficult to measure the individual effect 
of each architectural feature on the system performance. 

3) How is the evaluation performed 

Because it is not feasible to build a new system each 
time one of the architectural features is altered, a model 
is required. 

Poca! s  eMroughn Ene se ~or 9a model Chat the 


performance effects of any architectural feature will be 


determined, this model has to be able to quantify, ina 
precise manner, the effects of any change in the 
eeenitecture. 


4) For which application are the measurements valid 

The application for which the system is being used has 
an effect on the system performance. No system will show the 
Same performance in two different environments. For example, 
in one application the user might be doing only word- 
processing, and, Thee thew second, the system might be 
floating-point intensive. 

There are, nevertheless, systems that present a balanced 
performance throughout a diversified number of applications. 
They are the so called "General Purpose Computers". But even 
for these, the performance fluctuates, indicating that 
general purpose computers have a better performance for some 


Soeerecattons than tor others. 
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Due to these reasons, the system performance evaluation 
must pay attention to the rigorous definition of the appli- 
cation for which the system performance is being evaluated. 

This requirement for a precise definition of the appli- 
cation, will clarify the validity of the conclusions. 

5) Which factors interact with the measurements 

In the second question, the need to make just one change 
at a time when making the evaluation is emphasized, other- 
wise it would be impossible to determine the individual 
effect of an enhancement on the system performance. 

Specifically if the evaluator has already made measure- 
ments for several changes in the architecture and has also 
quantified the effect of each of those changes on the system 
performance, it is possible to compare two systems, that 
differ by all those changes plus an extra one, not yet 
considered. As a result of the analysis, the effect of this 


last change on the system performance can be quantified. 


C. THE Ree ClseC-CoOnTROVER ow 

Because the problem being discussed is related to 
computer architecture, there is a need for a concise state- 
ment defining Computer Architecture as it is commonly 
understood. 

The adopted definition is the IEEE standard 729-1983 
stating Computer Architecture as: 

"The process of defining a collection of hardware and 
software components and their interfaces to establish a 
framework for the development of a computer system. " 

In the published papers on RISC, several comparisons of 
CISC and RISC examples were made. 

The way these comparisons were done did not give any 
insight, to the answers to the questions presented in the 
previous section, or other similar questions. 

The result is that now, no one Knows for example, if the 


performance of the RISC II is due primarily to Tessreqistes 
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organization scheme, as some claim, or to the simplicity of 
its instruction set, as others do. 
specifically, 
1) If one wants to evaluate the effects of reducing the 
Pleo sos MnO wimomtenplek a CloC machine eé. ¢. 


the VAX-11 and consider the improvements due to all 
the instructions whose execution is equivalent toa 


seGucieccmon Simpler IMeeEeuctitons. For each of these 
more complex instructions one could determine if the 
execution is faster than the equivalent sequence. er 
Chat is not the case, EC mo snocricelOmmsioulad be 
Guiscanced. iMeaneetiniemovement 15 Seen, then consider 
eae cost of adding the instruction to the instruction 
Sec. 


2) If one wants to evaluate the effects of reducing the 
number of addressing modes, one should consider: 


- Why are they needed ? 
- With which data types are they used ? 
- What its the benefit brought by its addition. 


3) If one wants to evaluate the effects of overlapped 


register windows, one should test implementation of 
overlapped windows on several systems and measure, as 
a cost/benefit ratio, the effect of overlapped 


windows on the system performance. 


4) One cannot change more than one feature at a time and 
oe eeonsTet an dea otha the effect of each 
feature is on the system performance. 


>) If one wants to do an evaluation using Pog eats 
Wiekecen in a high-level Fue ae - one should state 
mitermas aA limiting factor. Mgccucimheromemc Omp1 Lers 
enerate differen code, some compilers are better 
Nan others and therefore make different contribu- 
tions to the sp Gaead performance. Furthermore; in the 
case of compiler generated code, the frequency of 
Snoewmlonwot Cach INSemucEiOom in the system instruc- 
E1om set will be different for different high-level 
BRR es a: Besides, two different systems with 
distinct instruction sets do not necessarily have the 
same best compiler. ; 


6) If one wants to make some conclusive statement about 
the advantages and disadvantages of the RISC archi- 
Meeuite, One MUSt Separate the effects of features 
that are orthogonal to the RISC philosophy. 

finemeeract 1s that in the papers published on RISC’'s, 

almost all the comparisons’ made, involved systems with 
different instruction sets, different addressing modes anda 
different number of registers and registers organization 
schemes. Furthermore compiler generated code was’ used 
without considering the performance effects. These are the 
reasons why no one can say whether the RISC architecture is 


Oe ils not better by itself. 
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In this” sSitwaieiwen- while the RISC proponents are 
bringing some jargon to the architectural environment, those 
against RISC are losing track of the possible benefits 
present in the RISC philosophy. 


D. AN EXAMPLE 
As an example, let us pick a common CISC processor, the 
MC68000 and consider its addressing modes. 


The MCo8000 has six basic types of addressing modes, 


namely: 
1) REGISTER DIRECT - The effective address is the 
register designation field in the instruction. 
EA = Rn 


2) ABSOLUTE - The effective address is that given aan 
instruction field itself and it is used diver 
Wi CNOUL Mod Ist carom 
EA = INSTRUCTION FIELD 


3) REGISTER INDIRECT - The effective address is the 
contents of the designated register 


EA = ( Rn ) 


4) IMMEDIATE =- The operand is part of the instruction 
itself and no further addressing is needed 


9) PROGRAM COUNTER RELATIVE = The effective address is 
computed by taking the value in the program counter 
register and adding or subtracting an offset value 

EA = 92 Bayo h ee 
or 
EA =" PCe— GBH SET 


6) IMPLIED - The operand is ina register designated by 
the mnemonic of the instruction. 


The uses of each addressing mode depends on the 
programmer. 

Until now, the philosophy present in the design process 
was to give the maximum versatility possible to the 
programmer, so that he or she could choose the address mode 
better suited to his or her needs. The rise of the kites 
architecture brings some questions regarding the correctness 


©f this eawlosoru 
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In order to answer these questions, there is a need to 
have a correct method for the evaluation of a system 
performance. Together with the evaluation method there are 
some points that have to be considered when deciding how 
many addressing modes to include in the system instruction 
set and how long each addressing mode should be. 

The considerations are to: 

1) reduce the storage requirements per program 

2) reduce the number of bits that must be moved between 

processor and memory to execute a program, es 
reduce the bandwidth requirements on the bus 


Bye reduce the average length of an instructiony i.e., 
reduce the required width of the instruction bus. 


There is a trade-off between the number of instructions 
needed for the system to execute a program and the average 
Past ruction size. 

The decision regarding the number of addressing modes to 
include is also very much dependent on the application, oon 
the data types, on the operations involved, on the use of 
nested procedures, and how the parameter passing operation 
is accomplished between procedures. 

Although not considered here, the addressing problem is 
also very much related to schemes of memory protection where 
one wants to forbid the regular user program from accessing 
some part of memory. 

Besides how each one of the addressing modes is used, it 
is also important to consider the frequency with which each 
addressing mode is used. 

Not much material is available regarding the usage of 
addressing modes. As an example, consider again the 
addressing modes of the MC68000. 

Pe eeclolisky DIRECT 

Since the operand is, ihevenas case, in a register, no 
memory accesses are involved. This provides some speed 
advantages when used for operating on frequently-accessed 


Variables. For infrequently-accessed variables it would not 


be used because the number of registers available on-chip is 
usually very small. 

2) ABSOLUTE 

A memory access cycle is involved in absolute 
addressing, because the operand is in memory. For this 
reason it 1s not as fast as the previous mode. 

Absolute addressing does not have much versatility 


because the instruction address field is constant and the 


operand must reference a fixed location in memory. 
Nevertheless, it is simple. Because no alteration on the 
address field of the instruction is performed, absolute 


addressing 1s an efficient mode to use when the operand is 


within the range of the instruction. 


3) REGISTERS INDIRECE 

In the register indirect mode, one register access plus 
one memory access cycle are involved because the register 
holds the operand address and not the operand itself. 

The register indirect approach is used when the address 
of the operand has just been calculated. It provides 
address-range extension, and in fact this extension 
increases with the difference between the size of the 
instruction address field and the size of the specified 
register. 

4) IMMEDIATE 

Immediate addressing is the fastest way of addressing, 
although it is Limited by. the instruct temas No addi- 
tional memory accesses are needed since the operand is 
Within: the insStructrem tese vr, Since programs are not self- 
modifying it is used only for predefined values---constants. 

5) PROGRAM COUNTER RELATIVE 

The major advantage of relative addressing is that it 
allows the generation of position independent code because 


the location referenced is always fixed relative to the 
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feegram counter. The importance of this fact is very much 
dependent on the memory management scheme adopted in the 
system. 

In addition to the regular memory access, an addition or 
subtraction must also be executed. It is used in relative 
jump instructions e.g., to set up loops or to set up parame- 
ters to be passed to a subroutine. 

ee IMPLIED 

Implied addressing is equivalent to the register direct 
addressing. However, implied addressing restricts the 
opcode to the predetermined register specified by the design 


of the opcode and the design of the processor. 


Be OUGGESTED APPROACH 

It is not feasible to build a new system each time a 
Single architectural feature is changed, in order to eval- 
uate its effects on system performance. 

As a result, there is then need for a model. 

This model should be clear, complete, and able to 
reflect the interrelations that exist between the different 
components. The model should also be applicable to any 
computer system, i.e., the model should be general. 

The model should reflect the performance effects of any 
computer architectural feature such as: 

* Bus Width 

¢ Addressing Modes 

weer ipelining 

* Instruction Queue 

° Momence 1on Prefetching 

In the method suggested for computer performance evalua- 
tion, a comparison is made between a reference system and 
the same system with some change. The reference system is 
mie comoucer system for which it 2s desired to determine the 
impact of each architectural enhancement. The result of 


MiomcOlparason will then constitute a measure of the 


Za 


performance effects of the particular change. The concep- 
tual view of the system used in the model is illustrated in 


el il ts  ore 


TwstRuchion PERFORMANCE 
ser 





Bicgiae oa Conceptual View. 


Four entities are considered: 


i). The@ApoiGcation a0, Sues Ue will One be valid 
for a certain application, not for any application 


2) The System being considered 
3) tne System Instruction sec 


4) The Performance, as the object of the evaluation 
process. 


The instruction set constitutes the central point clea. 
conceptual view. The application uses a1: The system 
SUDDOLtS 11 c The best match will necessarily give the best 
performance. 

The application is characterized by a set of tasks that 
must be performed. Each task is performed with a different 


frequency. For each task a program must be written, so that 
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one task is mapped into one program. Each one of these 
programs executes in a different time. 

The weight of each task or its representation in the 
application is then the product of the frequency of its 
execution and the corresponding program execution time. 

The effects of the application on the system performance 
are the frequency of execution of each instruction in the 
system instruction set. This together with the average 
execution time of the programs of interest will ultimately 


eo Co obama: tne vapp Lica ction. 


lead to a " typical 

The system supports an instruction set in two ways: one 
by the execution time of each instruction and the other by 
the complexity of the control unit necessary to implement 
the instruction set. 

An instruction set is desired that allows for the 
writing of programs with a minimum execution time, but also 
minimizes the amount of support that has to be given by the 


system. 


Za 


IV: TIMING VANAry Ss 


A INIRODUCT ION 

In this chapter a detailed analysis of the model for 
computer performance evaluation is introduced. As described 
in the previous chapter the model is divided into two parts. 
In the first part, the model considers a timing analysis. In 
this analysis the application determines the dynamic 
frequency of execution of each instruction present in the 
system instruction set and finally the system architectural 
characteristics determine the execution time of each 
instouceiene- 

In the second part of the model, which follows in the 
next chapter, the model considers the relation between the 
application and the control unit necessary to implement the 
system instruction set. From this relation a performance 
figure is obtained. 

Any architectural feature will have consequences both in 
the execution time of each instruction and in the complexity 
Of Ehewcontuo ) . 

As has already been mentioned the first part of the 


model is a timing measure. It will consider the execution 


i" 


time of the specified application's typical program. 

Several factors contribute to the execution time of a 
program and not all of them are part of the computer archi- 
tecture. Some have depend on the implementation of the 
system. 

The implementation is very much related to the tech- 
nology chosen. The technology will determine, for example, 
the maximum clock rate obtainable and the number of computer 
components to be placed on chip. 

wo factors have a great impact on the system perform- 


ance, they are the clock rate and the average memory access 
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time. Also the number of components on chip is an important 
factor, since one of the most time consuming operations is 
to transmit data from one place to another. For example by 
being able to have more registers on chip, one might be able 
to reduce the average operand access time and therefore 
speed up the computer operation. If one considers the 
storage registers as part of the system memory then one can 
see that the average memory access time is reduced. 

In the suggested approach to computer performance evalu- 
ation, the main concern is architectural features and not 
implementation restrictions due to technology limitations. 
The reason for this is that a method to evaluate computer 
performance should be general and therefore be able to 


Survive constant technological change. 


Ee elo «COMPUTER SYSTEM 

Any computer system architecture is made of hardware and 
software tools. In the area of software, an important factor 
is the operating system. 

For the sake of simplicity, and since in fact the oper- 
ating system is also a program that has to be run on the 
system, it can be considered as part of the application in 
the computer performance evaluation process. 

If the operating system is not considered as part of the 
application software there would be a need to track all 
calls to the operating system, measure the time the system 
takes to execute the correspondent subroutines and subtract 
Chis from the program execution time. 

In the hardware, the major components are: 

i) the processor 

1i1) the memory 

iii) the busses 

iv) the I/O interfaces 


Vee Gate Circuits 
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The processor consists of the portions of the computer 
made up of the control unit, the arithmetic logic unit, weed 
general purpose registers and the busses that connect all of 
these. 

The memory consists of all the parts of a computer used 
for either temporary or permanent storage, for instructions 
Gr Lor daear The busses are a collection of signal lines 
With multiple sources and multiple sinks. They provide for 
the intercommunication capability among the other computer 
components. The I/O interfaces are the parts of the 
computer through which the system communicates with the 
outside world. 

In order for the overall system to have a good perform- 
ance, it is desired to balance the average work done by each 
component per unit of time. Since each computer component 
has a different function, the work done by each is different 
from the others. It is this work that has to be character- 
ized, so that an understanding of how to maximize it, is 
possible. 

One requirement is that the idle time for each componeme 
should be as low as possible. For example the processor 
should be in an idle state fora data element stored in 
memory as little as possible. 

1. Memory and I/O Interface 

Both memory and I/O interface can be considered 
together, Since both are communication media. Memory 
performs a communication between two instants in time. T/o 
interfaces perform a communication between the computer 
system and the outside world. 

For both memory and I/O the work 1s characterized gy 
how Long it takes to correctly receive a Unit of informatzes 
from the bus and how long it takes to correctly place the 
Same unit of information on the bus. TALS uniG of interne 
tion will be the same in the case of instructions and data. 


This Unit of Infeormatrone Ls ther one bre. 
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Pougmmocen memory and 1/0, the measure of their 
performance is the number of bits that are received or 
transmited per unit of time. This is in fact no more than a 
meanmawidth in units of bits per second. 

For example, a memory unit with a word size of 
sixteen bits and an access time of two microseconds performs 
the same work as another memory with a word size of thirty 
two bits and an access time of four microseconds. 


HENoRY BAND WIDT = ee see CST /gee ) CHa) 


NETORY access TIME 


2. The Busses 

The function of a bus is to pass information from a 
computer component acting as a source to other components 
acting as’ sinks. the memory and I/O interfaces are also 
communication media that treat data and instructions in the 
same way. 

The nature of these signals has no influence on the 
characterization of the bus work or the efficiency with 
which the bus preforms its work. 

thes bucmwerk is characterizee bay: 


1) the number of active sources ata time, here 
assumed to be one 


11) the number of active sinks 
111) the number of signal lines, i.e., the bus width 
iv) the bus cycle time 

As its function is to be a communication medium, the 


bus work is measured by a bandwidth in units of bits per 


second. 
The particular bus bandwidth will be given by: 
Bus BANDWIDTH — Sa, Wt ( BIT /see ) 
Bcr ( the ) 
where 
SI - is the number of active sinks 
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WI - is the bus width 
BCT - is the bus cycle time 
3. The Processor 

After receiving data and/for instructions from the 
bus, the processor alters this data according to the 
sequence of instructions and then delivers the final results 
back to the bus. 

While the previous computer components treat data 
and instructions in the same manner, this 1s not true for 
the processor case. In this case, instructions specify the 
operations that have to be performed, and the data consti- 
tutes the object on which the operations are performed. 

The structure of the processor, i.e., the specific 
configuration of each element is dependent on the instruc- 
tion set and on the data types involved. The instruction 
set configuration makes requirements on the processor, 
because the instruction set is intimately related to the 
processor control unit and the datapath. 

The data types involved in an application should be 
supported by the processor. If, for example, a lot of array 
manipulation is done, then it 1s to be expected that the 
system considers some paralVel Opera t iemmecd pao. in - 

In addition to the data types, Che instruction see 
1s also dependent on the Vappliecation- Therefore the 


processor structure is also dependent on the application. 


Ce - ThE SAr Pie aT ion 

An application is characterized in the same way indepen- 
dent of the computer system being evaluated. It is charac- 
terized by a certain number of tasks that have to be done. 
Each task 1s executed with a certain frequency. For each 
task and for each system there will correspond a program 
written with that system instruction set. 

The frequency of execution of each task is given by the 


number of times (n), that this task 1s executed in a sample 
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eee tasks. so the frequency of execution of each task is 
MocaiIng more than the probability of this task being in 


execution at any given time. 


ie 


re 
a (4.3 
u ) 


a - is the frequency of execution of task i 
- number of times the task i was executed in a 
big sample 
Nl = total number of tasks that were executed in 
that sample 

For each task there is a corresponding computer program. 
This program will take some time to execute. 

The weight of each task or its representation in the 
application will be given by the product of its execution 
frequency and its program execution time in the system under 
Sieve. . 
ny eae (4.4) 


¢ ( 


Wr - weight of the task i in the particular appli- 
cation and for the system in study 
alas [=mOXECUCION Cime of the correspondent program 
By this it 1s seen that the weight of the task is both 
dependent on the application choice and on the system 
choice. 
A program is a sequence of instructions. Its execution 


time can be divided into smaller pieces where only one 


instruction is executed. In this way the program execution 
time is given by a sum of products. Each element of the sum 
will be referred to a single instruction, and consists of 


Ss 


the product of the instruction execution time and the number 
of times each instruction is executed. 


Therefore each element of the sum will be given by: 


> ~ Ni x a (4.5 ) 


where 
N; - is the number of times that the instruction j 
1s executed for the particular program 
IXxT, - execution time of instruction j 


The program execution time will be given by: 


J 
Ja 


Sj ~ the weight of instruction j in the system 
instruction set and for the particular task 
J - the total number of instructions in the 
system instruction set 
Finally, the weight of the application for the system 


under study will be given by the weighted sum of its tasks. 
50, 


W 


a 


ae 
Z 2 W: (oe 


but since 
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then 


als 
Wa = pe F a af (4.8) 


But 
oF 
ai aa - (4.6) 
van 
and 
Sj = No « ExT. (4.8 
SO, 


D. THE PERFORMANCE 
A comparison is made between the weights that an appli- 
cation has in two different systems. In this chapter, where 
a timing analysis is done, the weight of an application 
involves the execution time of each instruction and the 
dynamic frequency of execution of the same instructions. 
The performance will be given by the ratio of these two 


weights. 


Peal - Wa ( 4.to) 
Na 
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Note that 


is the weight of the particular apolicatien 
for the reference system 
is the weight of the same application for the 
system being considered 
the two systems either have two different 


instruction sets or the time of execution of each instruc- 


tion 1s dlftfrerent. or orn: 


SOs 
Pe \< 
a see 
(sy Ae = 
Therefore 
a a, 
; Ne oT 
2. : 2. See 
? t= | jai 
PF | (4.12) 
Be \< 
> F N. Ixt, 
L=4 Ag 21 
where 
I - is the total number of tasks in the particular 
app Lea tvOn. It is the same as the number of 
programs. 
J =- is the total number of Instructions in ee 
reference system instruction set 
K = is the total number of instructions in ec 


system in study instruction set 
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Considered in this way the measure of the performance 


for a system is better the larger the ratio. 


E. A SPECIAL CASE AND THE RISC 
If the application involves only one task and therefore 


only one program, the performance would be given by, 


Me 


3 Ny Ext) 


Jj? 
\K 
” N, IxT, 


ferk = (4.13) 


& 


Let us now consider the RISC philosophy. For this case 
the value of J is fixed. | 


The RISC proponents advocate that by reducing the total 


Mmuimoecr Of instructions in the instruction set l.e., Diy 
reducing the value of K, the performance of the system 
inceases. They also advocate that the instruction execution 


time for each instruction is reduced by having a simpler, 
more straightforward machine with better performance. 

Their argument is that the value of the denominator is 
reduced because the two previous factors compensate for the 
necessary increase in the number of times each instruction 
is executed. By reducing the denominator the system will 


have a better performance. 


Be lHE SYSTEM ARCHITECTURE AND TIMING 

As has just been seen, the particular choice of applica- 
tion determines the dynamic frequency of execution of each 
Perse nUuctlonmean Che instruction set. mOomeGcOnkinue the study, 
there is now a need to analyze how the system architectural 


Smaracteristices influence the system performance. 


oy, 


The system structure and its instruction set are neces- 
sarily related. For e€Very Insteruceion the system has to 
have the necessary support in terms of the control unit and 
the datapath. Also, any new enhancement to the system 
architecture will affect the execution time of one or more 
instructions. Therefore it will always affect the average 
instruction execution time. 

The model under discussion considers that each instruc- 
tion has acertain associated weight, this weight being 
dependent on the application and on the system architecture. 


The application determines the number of times each instruc- 


tion is executed, i.e., the dynamic frequency of execution 
of the wnseGructioen: The system architecture determines the 
execution time of each instruction. It is this execution 


time that will now be studied. 

We define the Life Cycle of an instruction (LC) as the 
time period beginning at the instante the instruceren eee 
first fetched from memory and ending at the instant the 
final results produced by the operation are stored back in 
memory. 

The instruction execution time will then be some portion 
of its time life cycle. This portion will be dependent on 
the system architectural characteristics such as pipelining, 
parallel processing, instruction prefetching, instruction 
queue, etc. 

The main phases through which an instruction has to pass 
in its life cycle are: 

Lt) Fecehing 
11) SP xe eu om 

The time the system takes to fetch an instruction is 

dependent on the instruction bus width, the instruction 


length and the bus cycle time in the following way: 


Inste Gs 
c = 20? 2 ee x (Bos Crete time) (4.14) 


BOs SOUSTE 
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TitomvValiemwrer tle fetch time will be am average, more 
or less rigorous, depending on: 
Pi fiserUe tO SliZem f£lxed Or variable ) 
11) the availability of the instruction queue 
Not all the instructions have the same structure, but 
nevertheless, all of the instructions accomplish some trans- 
formation on some data. The data might be one or more oper- 
ands andthe final result in the case of an arithmetic 
imeeseruction, or the data might be the contents of the 
‘program counter in the case of a branch. 
In order for the system to be able to accomplish the 
transformation required by the instruction, it has to: 
1) decode the instruction 
2) locate the data ( e.g., addressing modes ) 


3) place the data in a convenient location to be 
ransformed, if it is not there already 


Wea oeeoOmimeeene thansrormacion asked for by the 
PSG On 


5) relocate the data in a convenient location. 

Whether these phases are performed in a sequential 
fashion or in parallel depends on the system architecture. 
For example, suppose that the instructions followed a fixed 
format with separate and predefined fields for OPCODE and 
ADDRESSSING. Then it would be possible to decode the 
instruction and the address field simultaneously. 

In order for the system to process the addressing mode 
and depending on the particular address mode, it may have to 


do one or more of the following: 


- preform data aos tems, e41ther BegioeencO— 
register or memory~-to-register; 

= preform some addition e.g., in the case of base 
addressing, index addressing Sig Disaie hn 
addressing; 

= Cason some multiplication e.g., in the case of 
jew ot hdex mode. 


For the sake of simplicity one could eensider allerche 


data transfers that have to be done while the system 
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executes a program and determine an average time for data 
transfer. 

Typically if the system has on-chip registers, cache 
memory andmain memory, the value for the average data 


transfer time will be: 


+ oa C M 
pr = ma CRAT) + = (CAT) + — (mat (4.15 
it IF i ) 
where 
R =- number of register accesses 
C - number of cache accesses 


M-=- number of main memory accesses 
T = total number of data transfers 
RAT =- register access time 

CAT =- cache access time 

MAT - memory access time 


and 
TR Ce aM | (4.16) 


In summary, in the anstruction lite e€yele one hac. 
EE - fe0ehing time 
TDEC =- decoding time 
TLOC - locating data ( address mode ) 
TDATA - access data 
TOP =- perform the operation 
TW - write the final results 
If the system performs all of these time phases in a 
sequential fashion so that there is no overlap, then the 
instruction time life cycle will “just be the summation joe 


all the time phases: 


LCno = TFE+TDEC+TLOC+TDATA+TOP+TW (no overlap) (4.17) 


If some overlap among the phases is present, then the 
instruction time life cycle will be some portion of the 


previous value (no overlap case). 
LCo = y * LCno (overlap case) (4.18) 


where 
y-~ is a coefficient that measures the efficiency 
of the architectural scheme that accounts for 
the overlap possibility. Its value will be 
always between zero and one. 


Some of the architectural characteristics that might 


influence the value of y are: 


= Beene ce of  COMmMoOnwMeMonlese £0 data and instruc 
Ie Sy 


- instruction format 

- instruction type 

- bus width 

- dual port memories 

The architectural characteristics will also determine 

the amount of overlap execution among different instruc- 
tions. The efficiency of this overlap will then determine 
what portion of the instruction time life cycle value will 


be the instruction execution time (IXT). 
iV La= we bce (4.14) 


where 


IXT = instruction execution time 


a1 


w - efficiency of the overlap among the time life 
cycles of different LMS crc ce LOimer Values 
ranging £rom Zero sto, ome: 

The value of w, that is the amount of overlap will be 

determined by several architectural characteristics such as: 

- pipelining 

- prefetching 

- instruction queue 

- parallel processing 

- instruction length 

- bus width 

- datapath 
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Ve CONPRO@Ge ANALYSIS 


A. INTRODUCTION 

In the previous chapters a timing analysis of the system 
Operation was presented. In it a study was made first of the 
application effects on performance through the dynamic 
Beeoamency of execution of each instruction, and second of 
the system architecture effects on performance through the 
execution time of each instruction. 

Finally to complete the model being suggested, one has 
to consider the requirements that the instruction set poses 
on the system in terms of the required control complexity. 

These requirements will also be dependent on the 
application. 

This is also important since no matter what technology 
1s used in ‘the system implementation, the number of 
resources available on-chip will always be limited. 

Typically the control unit is implemented using either 
microcode or is hardwired e.g., using programmable logic 
arrays. Some of the factors that impact the choice are: 

* instruction set complexity 

me required control unit size 

* possibility of future changes in the instruction set 
* speed 

The size of the control unit (i.e., the number of gates 
needed to implement the control unit) will determine the 
space available on-chip for other components. In the case of 
the RISC I and II the smaller control unit and therefore the 
smaller power consumption, allowed the designers to add more 
registers to the processor chip. With the choice of addi- 
tional hardware for the processor , the designers in fact 
reduce the average memory access time if one considers the 


registers as also part of the system memory. 
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B. THE CONTROL UNIT AS A FINITE STATE MACHINE 

The control unit of a computer system can be viewed as a 
finite state machine, and therefore can be analyzed as such. 
If analyzed in that way, the control unit operation can be 
described by austare mecca: am: In its most simple andemocr 
general case, the state diagram will typically have only two 


states, see Figure 5.1. 


FETCH 


NEXT JNSTRUCTION 





E KECUTE 


Figure 5.1 Simple Control Unit State Diaguan: 


In a more detailed analysis, the control unit state 


diagram will have a tree like format where any vertical path 


will correspond to the execution of an instruction, see 
EUCuir es 2i7¢ 
In this case, each and every instruction is identified 


and each state although, still belonging to one of the two 
major phases fetch and execute, will now correspond to a 
microstep in the control unit output sequence while the 


system is executing a program. 


i4 


Pecove 
TNs TRv cTiowd 





Figure 5.2 More Detailed Control Unit State Diagram. 


Of course this is complicated if the system is able to 
deal with more than one instruction at a time. Nevertheless 
the complexity of the controller can always be associated 


With the number of states. 


CF Dag Colmron UNIT COMPLEXITY 
Not ale the states will count in the same fashion since 
there are states that will be common to more than one 


Mm@oesuctcion Or vertical path. 


The number of these shared states will depend both on 

the processor instruction set itself and on the implementa- 
tion choices made by the processor designer. For example, in 
this last case the processor designer could make useuie. 
microcode subroutines to be shared or called by more than 
one instruction. 
. If states are shared among instructions, then there will 
always be some trade-off between the total number of states 
of the control unit and its speed. This tradeoff is due to 
the fact that when states are shared among different 
instructions, the control unit has to have some feedback 
capability. The specific value of the feedback will force 
the next state of the control unit, when the vertical paths 
corresponding to the instructions will ultimately separate 
themselves. 

No matter what this feedback will be, it will always 
have some cost related to it. dhe ¢ost as the e€xtrayea ee 
takes for the values of the feedback signals to be valid. 
Since the cost is time, it will be reflected in the average 
INnSErUCEITON execuciton cine: and so affect the performance of 
the system in the portion the model described in the 
previous chapter. 

In this part of the model we focus on the comparisons of 
twWO (GOncrol- Uns: 

The complexity of a particular instruction will then be 
dependent both on the number of states it has and on the 
number of states which are Shared by more than one 
Instructi6s:, 

The cost of adding a new instruction to a certaue 
processor instruction set is the number of new states that 
have to be added to the control unit state diagram. The 
addition of this instruction will Nave™ ja CeOce ene teme mie 
performance that can be minimized by maximizing the number 
of states necessary to its execution that are already in 


existence in the control unit state diagram. 
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Rortiapwiincg tO tne CONcrol unit the number of states is 
then dependent on: 
i) the number of instructions 


ii) the number of states that are common to more 
than one vertical path (or instruction) 


iii) the average height of each instruction 
Where the height of one instruction is defined as the 


number of states in its vertical path. 


D. TITHE APPLICATION AND THE CONTROL UNIT 

In the previous chapter the instruction set and the 
dynamic frequency of execution of each instruction together 
with the instruction execution time were considered. Now 
one wants to know how effective, the control unit is for the 
application where the processor is being used. 

It has already been seen that the complexity of the 
control unit is related to the number of states. One knows 
that a smaller and simpler control unit has.an effect on the 
processor performance, because more space would be available 
on-chip for other resources. One choice might be to add new 
registers to the processor chip and thus try to decrease the 
average memory access time. 

One also wants Eom acmmenemmuinioe= “OL Instructions 
that are needed in order to perform a certain task, so one 
has to go back to the application. An application is char- 
acterized by a certain number of tasks that have to be done. 
Each task is performed with a certain frequency. For each 
task a program will have to be written using the instruction 
set available. Each program corresponds to a sequence of 
instructions used to perform the corresponding task. 

Directly from the program it should be possible to 
compute the static frequency of each instruction. But that 
Meme mot the only frequency that is of interest to the 
performance evaluation process. The dynamic frequency of 


execution is more important. 
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The two frequencies will be different for each instruc- 
tion depending Wor; 
1) program sequence 


11) conditional branches and the most frequent values o£ 
- the variables on condition. 


The execution of a program is then a sequence of several 
instructions execution. 

Since a single instruction corresponds to a vertical 
path in the processor control unit state diagram, the execu- 
tion of a program will then be an up and down walk on the 
state diagram. 

When comparing two control units, the one that would 
have to execute fewer instructions, supposing that the 
average height of an instruction would be the same for both 
control units, will be the best. The height of an instruc- 
tion is in fact a measure of what the RISC proponents call 
the instruction complexity. Because it would be natural that 
two different processors have instruction sets with 
different values for the average height of an instruc 
the becveomm line Ses Sehateene comparison of two control Umma 
complexity cannot be done through the counting of instruc- 
tions executed, but through the counting of the number 
states through which each control unit has to pass when the 
system executes a typical application program. | 

It is to be expected that if one wants to add an 
instruction to a processor instruction set, the control unit 
will suffer by an expansion. For a hardwired implementation 
e.g., using PLA's these will have to grow; for a microcode 
implementation typically there will be a need to increase 
the size of the microcode memory. The amount of the control 
unit expansion will be dependent on the implementation, on 
the instructions tise it and on the designer's choice 
regarding the number of states that will be shared Swi 
SX1SEING ust eueerons. There is a relation between the 


number of gates used in order to implement a controller and 


the number of states present on the controller state 
diagram. 

Because there is a direct and individual relation 
between the control unit states and the gates that compose 
the control unit, and because one wishes to use each and 
every one of these gates a similar number of times in order 
bomincrease the overall efficiency, then for better effi- 
Seaway it is desirable that all states are used in a 
balanced way. With some similarity one might say that the 
efficiency of the use of an instruction set increases when 
all the instructions in that instruction set tend to be used 
an equal number of times. 

An application has an indirect relation to the number of 
states through which the control unit has to pass in order 
for the system to execute the corresponding programs. 

fa the Optimum case the control unit will have the 
Hobbowing Characteristics: 

i) minimum number of gates 


11) for the specific application all states will be used 
in a balanced number of times 


1ii) no state exists that will never be used. 


ee LHE MODEL 

Assume that a control unit has a total number of states 
a, Associated with each state there will be a certain 
number of gates. This number will be dependent on the imple- 
mentation choice, either microcode or hardwired Logic. Of 
these T states, an application uses 5 states, and of these 5 
states some states will be used more than others. 

The weight of the application is related to the number 
of states through which the control unit has to pass in 
order to execute the corresponding programs. 

Each state has some weight associated with it. APokes 
weight will be dependent on: 

1) the number of times the state is used 


ii) the number of instructions that share the state 
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aeeeleg) ie ee of gates needed for implementing each 
state. 


The complexity of an instruction will be related to its 
height, that is the number of states in the corresponding 


vertical path in the control unit state diagram. 


So; 
H 
Cj = - W (5.1) 
een 
where 
Ci - complexity of the instruction j 
Ww, - weight of state h 
Moos belght eof Ghee mS rite ctor 
and 


Wp = Go (Size 


G - number of gates per state ( implementation ) 
LU, = number of instructions to which the stare 
common 
The weight of an instruction will be the product of the 
number of times the instruction is executed for a given 


program times the anstruction complexity 


That is 
- = N: Gs 5.3) 
a et (5.3) 
where 
N; - number of times the instruction j is executed 
As in the previous chapter, the weights of the task and 


the application will be: 


WJ. = ce y ie > (S.4)\ 
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where 
Wi - weight of task i 
¥; - frequency of task i for a certain application 
y) - number of instructions in the instruction set 
For an application its weight will be: 


as “as 
= r. a. (5.5) 


or 


Me 


I H 
2 oe Ni e a feng) 


c.° 
ty 


where 


Wa = weight of the application 
ay 


- number of tasks in the application of 
interest 
ee wmoer esone lnsirue tons in the processor 


instruction set 
H - height of each instruction 
Similar to the timing analysis in the previous chapter, 


the performance of the system under study will be given by: 


arf = Wa_ (sa) 
Wa 
where 


Wa - weight of the application for the reference 


system 


— 


weight of the same application for the system 


being considered 


oul 


SOr 


where 


ae 
i 


es 
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Sz 
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M™Ma 
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Ma 
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INSELUCELONS 


system instruction set 


K -= number of 


<1 


INSEPUCtIORS on 


Sicily ens. c alietethOn Set 


herent 


ot 


Piste TO as 


system instruction set 


height of 


INnStructi On tee ene 


study instruction set 


N. - number 
while 


typical applacacionspredeam 


while 


of times 
the 


Tn St riick on 


reference 


~alne program 


Go - number of gates per 


the system under 


system control unit 


SEUCY COncClo Punic 


reference 


diagram 


system 


QZ 


system 


study executes 


state in 


Uy - number of AWnstevicervonseraal 


the Contre! 


(5:8) 


t - number of tasks (programs) in the applicauien 
~J - number 


reference 
the system under 
reference 


system under 


executed 


the 


number of times the instruction k is executed 


the 


the reference 
G' - number of gates per state in the system under 


Share state h in 


state 


U, = number of instructions that share state 1 in 
Enews) scemelnaem = SCUdY COnCrOolL Unit state 


diagram. 
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VI. CASE ANALYSIS 


A. INTRODUCTION 

As an example we will analyze the change in performance 
of a particular application program when some floating point 
capability is added to a processor which currently performs 
fixed point arithmetic. 

In this case study, the performance effects of the 
program code sequence will not be considered. These er eae 
are mostly due to any capability of the processor related 
eo: 

* pipelining 
¢ parallel processing 

Specifically, the case consists in the possible addition 
of a floating point multiply instruction t€0 a procesaen 
instruction set. The processor that was chosen was’ the 
Motorola MC68000. The application for this evaluations 


the computation of a Fast Fourier Transform. 


B. THE ADDITION CF AN INSTRUC 

The addition of an instruction to the original inseeaa 
tion set has several consequences. 

First of all if a hardwired controller is used the 
processor's control unit must be expanded so that the 
instruction is incorporated. The amount of the controljupuee 
expansion is dependent on the number of new states that the 
instruction under consideration will add to the control, unas 
state diagram and also on the control unit implementation. 

Tn -facu; one of the reasons to use microcode in the 
implementation of an instruction set is due to the flexi- 
bility it gives in any future changes of the instruction 


set. 
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second and depending on the operation performed by the 
instruction, some hardware will have to be added to the 
processor. The amount of hardware that will have to be added 
to the processor is dependent both on the hardware that 
already exists on-chip, that the instruction might use and 
is dependent also on how fast one wants the instruction to 
operate. 

The addition of more hardware to the processor will 
cause a rise in the power consumed by the processor. Due to 
a limited power dissipation capability, the net effect of 
the increase in the number of gates that constitute the 
control unit and the datapath will be a reduction in the 
Size of existing processor components or a migration of some 
off-chip, so that the power consumed by the processor stays 
constant. 

Sie cucmec! Mice ebe eco replace “some of “the registers 
available on-chip by the hardware necessary for the new 
mis etruction. By reducing the number of registers on-chip, 
there will be a decrease in the ratio of register accesses 
to the number of main memory accesses. 

In the case of a Load/Store architecture such as the 
RISC architecture, a reduction in the number of registers 
will cause an increase in the dynamic frequency of execution 
Of LOAD and STORE instructions relative to the other 
isc ructions. 

In a traditional architecture, where the LOAD and STORE 
instructions are not the only memory reference instructions, 
the effect of reducing the number of on-chip registers is an 
increase in the average instruction execution time because 
the proportion of memory accesses to register accesses will 
increase. 

This increase in average instruction execution time will 
cause an increase in the typical application's program 


SxectueLlon time. It is this increase in execution time, that 


= 


will have to be overcome by the addition of the new instruc- 
tion to the precessomeenstrucetomesce, so that in fLacheene 
program execution time might suffer a reduction rather than 


an increase. 


C. THE COST/GAIN TRADEOFF 

The floating point multiply instruction after eer. 
added to the processor instruction set, will replace the 
sequence of instructions that the processor had to execute 
every time a multiplication of two floating point numbers 
was called for. 

In order for the addition of the floating point mulage 
instruction to be considered, the instruction has to pass 
several tests. The first test requires the instruction 
execution time to be smaller than the correspondent instruc- 
tion sequence execution time. 

If that is not the case, then there is no pointaanm 
adding the instruction to the processor instruction set. 

50, €OMeicer: 

Ini - execution time of the new instruction 

lseq - execution time of the corresponding sequence of 
InSGhuct. ons 

For the addition of the new instruction to be consid- 


ered: 
lni < lseq (6.1) 


Assume then that in fact the above condition is true, 
then 


lseq = Ini + lgain (6.2) 
or 


ao 


Ini / lseq=c (ae) 


where c < 1 

mor the sake of simplicity, consider that the applica- 
tion of interest is composed of only one task. That is to 
say that the effects on the processor performance will be 
considered only within the context of a program. 

The model suggested for computer performance evaluation 
has two parts, ameimimg analysis and a control unit 
complexity analysis. These two parts of the model will give 
Pesce tO two distinct criteria to which the addition of the 
instruction will have to comply. SOupenat. the gain in the 
processor performance that is obtained, will surpass the 
reduction or cost in the processor performance due to the 
requirements brought by the same instruction to the 
processor architecture. 

1. tainted Cai terion 
The timing model says that the effects of the addi- 
tion of one instruction to the system instruction set, on 


the system performance will be measured by: 


a 
) Nib} 


peas Me ee. (6.4) 
i 
De Ney La. t Naw Ln 
et 
where 
J - is the number of instructions on the original 


system instruction set 


my 


Nj - number of times that the instruction j is 
executed before the addition of the new 
instruction to the processor instruction sec 

Lj - execution time of the same instruction j on 
the original system 

Na; - number of times that the instruction j is 
executed after the addition of the new 
Detshoal ble 1eakepel 

La; - execution time of the instruction after the 
addition of the new instruction 

Nuuwy = number of times the new instruction is 
executed 
Lat = execution time of the new instruction 

The numerator is a measure of the execution time of 
the application program before the addition of the instruc- 
tion under consideration. The denominator is a measure of 
the execution time of the application program after the 
addition of the new instruction. 

The sequence Of ANSEEUG ele ricue a the original 
instruction set that implements the operation performed by 
the new instruction is executed a number of times. This 
number will be equal to Nnew. 

The sequence execution time will consist of the 
execution time of several instructions. 


Therefore 
al 
Lseq 2 2 ee L (6.5) 
a 


where 
Nua. = number of times that the instructicneieoreta- 
Original instruction set is executed during 


the sequence of instructions execution. 


a6 


then 


¥ ae 
ae : j=l lier 

Fy 

>. Sa. Lay + Nrew L new 

jt! 
and 

a Es No: 4 Naas recy (6.7) 

where 


Neo; - number of times the instruction j of the 


original instruction set is executed outside 
the sequence. 


For improvement in performance: 
eee. (6.3) 


This indicates that it 1s worthwhile to add the new 


MaserWetion to the original hHMsteeucelon set for this 
eo Li cation. 


Then, one wants 


2) s 
> No; Li - Le wees e a Lj > - a Lai + N new Laws (64) 
qi ue 


a 


vice 


sy 
l seq = i So) Lj (6.5 ) 


SO 


= J 
l No, Ly + Nr L sax > d. Na, La; + Nae Law (6.10) 


jet {Fs 


i 
N ned ( Lseq - L new ) » £ Siete La, -L al ei (6.11) 


The right term of the inequality corresponds to the 
increase in the application program execution time, that was 


caused by the suppression of some hardware components of the 


processor e.g., some registers. 
This increase, caused by an increase in the number 


of instructions that have to be performed--case of the LOAD 
and STORE instructions ina lLoad/Store architecture, Om 
caused by an increase on the average instruction execution 
time--case of a traditional architecture. 


Therefore 


JS 
La: a es No. Ly = TAMIng CosTS = Vener (6.12) 


i uy 
s 


oe 
tl 

Law 

’ 
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On the left term of equation 6.7, 
Lseq - Lnew 
represents the gain in execution time that was obtained by 
substituting the sequence of original instructions by the 


new instruction, each time the operation was performed. 


Dey 
Lseq - Lnew = Timing Gains = Tgain (6.13) 
Tae ty. 
Nnew Tgain > Tcost (6.14) 
Or 
Nnew > Tcost / Tgain (Gigs) 


Based on an timing analysis, it is only advantageous 


to add the new instruction if: 


ea.) 


1) Lsegq > Lnew 


and 


Z) Nnew > Teost / Tgain (6.17) 
To put it in another way, the addition of an 
instruction to a processor LisenleGrLonue set: Will’ “only 


increase performance if that instruction is executed a 


61 


sufficient number of times during the application progran- 
execution. The exact number of times the instruction must be 


executed is given by the above criterion. 


oe: Control Unie weonplexi ty Cricgemmen 


Concerning the analysis of the <¢entrew Une 


complexity one has: 


a M 
et. 1] 





4 
“fect Ss es eee , (6.\2) 
T H H rey 
De Nia: 7 Se + Narw a Ge 
° N A ey Uy jue} Uy 


glccey 


Since the implementation of the control unit will be 


the same and the implementation determines the value of GO, 


the equation simplifies to, 


J as 
yy 2D 
S L 
= ay A, 
‘Per F = (6.15) 
on H ae 
Ls Na, )- at +p N wea ne 
hee el eek 


As in the timing analysis one wants: 


Perf > 1 (6.20) 


oz 


That is 
(6.21) 


Aves 


H Hee 
Na: ps —_— + N Nt ) = 
Uy, 


Mr 


J * 
_ goalie 


ee Sk Ue gee dee 
As before, the execution of the sequence will 
consist on the execution of several instructions, then 
H 
ed at (6.22) 
Shuey Ly) 


Re! J 4.3} Uy 
Then 
5 a J ce 
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Ls = represents the gain in the number of states, 
obtained each time the operation performed by 
the instruction and/or the sequence is 
executed. 

ES - represents the cost in the number of states 
due to the addition of the new instruction 

Then 


Nnew * Ls > Es (6.25) 
or 


Nnew > Es / Ls (CES 


D. AN ILLUSTRATIVE EXAMPLE 
An example is now presented to clarify the use of the 
model suggested through the present and previous chapters.’ 
The example quantizes the effects of adding a floating 
point multiply instruction to an existing processor insti. 
tion set. | 
As has been previously stated, the values determined for 
the increase or decrease on the system performance will only 
be valid for a given application. 
1. The Processor 
The Motorola MC68000 is selected for this example. 
The MC68000 is awidely known microprocessor that has a 
Simple instruction set offering mo ftloadcingeeeine supoome 
The MC68000 has a 16-bit data bus anda 32-bit 
address bus. In addition to the Program Counter and Status 
Registers, the MC68000 has seventeen 32-bit registers. These 


registers are divided into two groups. - The £ieee. is coup, 
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composed of eight registers are general purpose data regis- 
ters. The second group, composed of the remaining nine 
registers is used mostly for handling addresses. 

In total, there are fourteen addressing modes on the 
MC68000, although they can be studied in six basic types. 
These addressing modes are already described in chapter two 
of this thesis. 

The instruction set of the MC68000 consists of 56 
basic instructions, having from zero to two addresses. Each 
instruction can use several addressing modes. This fact 
determines that the MC68000 does not follow a _ Load/Store 
architecture. 

The instruction set of the MC68000 supports five 
basic types of data: 

* bits 

eS bytes (3S bits) 

e words (16 bits) 

m= Longwords (32 bits) 


: nec cee binary-coded decimal (BCD) with two digits per 
yte 


The input/foutput on the MC68000 is memory-mapped, 


ie .”, all I/O interfaces share the address space with 


- memory. 


Considering the implementation of the MC68000, it is 
a Single-chip VLSI HMOS processor with a typical clock rate 
between 4 and 12 MHZ and with a typical memory access of 4 


clock cycles. 


Oe The Application 


For the application we choose a program that 
computes a Fast Fourier Transform. This program was 
obtained from ' The Fast Fourier Transform' by E. Oran 
Brigham [Ref. 4]. The program is written in Fortran. The 


flowchart of the computation done by this program is on page 
lol of the above reference. The program itself appears on 


page 164 of the same book. 


o> 


From the reading of the program, one can immediately 
verify that some of the operations that are called for could 
not be directly implemented with the MC68000 instruction 
Seu. 

For these operations it was necessary to use either 
subroutines present in ' Microprocessor Systems, a 16=-Bit 
Approach’ by William J. Eccles [Ref. 5] or newly written 
subroutines. The subroutines to handle floating point 
numbers in the MC68000 came from Ref. 5. 

The subroutines that were written are shown on 


appendixes C and D, these subroutines compute the sine and 


the cosine of an angle, according to an algorithm presented 
in the ‘' Software Manual of the Elementary Functions’ by 
William J... “Gody, J.R. and William Waite [ Ref. 6. ae 
125-143]. 


The translated program for the Fast Fourier 
Lransform computation is shown on Appendixes A and B. 
3. The Floating Point Representation 
The floating point representation that was chosen is 
the IEEE proposed standard for single precision. This stan- 
dard determines a 32-bit long representation of a floating 


POint number, shewnein si ¢ouremew 


EXPONENT MAN TISSA 


oO { Se 2 





Figure of 1 Floating Point Representation. 


This standard has the following characteristics: 


1) 32 bits are used 


66 


im) radix of two 


iii) the radix point before the first digit with assumed 
one to the left 


iv) mantissa 

ivsajmsclgn positirven = O 

Ive Dj Value positron = 9=31 

iv.c) representation - normalized, sign/magnitude 
Vv) exponent 

Veaestagn Dosttienas— mo sign 

v.b) value position - 1-8 


ne) eect oe - biased exponent, bias = 
127( dec) 


v.d) range of exponent - -126 to 127 


Wmierange Of floating podmt mumber - += 5.9*10**=39 to 
ey On * SS 


All the subroutines that handle the floating point 
data and that were used obey to this standard, so does the 
hardware necessary to implement the floating point TIE larote e 

4. The Hardware Involved 

The general structure of the hardware required for 
mie inolementation of an additional floating point multiply 
instruction in the MC68000 instruction set was obtained from 
the ‘Introduction to Computer Architecture’ [Ref. 7:p. 80] 
and is shown on Figure 6.2. 

The hardware consists of: 


eae ae ot eds (Cho ene commcG al bem some Of the 
already existing data registers on the. MC68000, 


141i) an 8-bit adder used for the exponent addition, that 
could just be the adder already existing on -the 
MC68000, 

iii) a multiplier used for the mantissa multiplication, 


iv) an exclusive-or gate for the product sign calcula- 
Ia, 


v) a normalizer and converter 
With the hardware structure that was chosen it is 
possible to perform in parallel the determination of the 
Sidn Of the result, the addition of the two exponents, and 


the multiplication of the two mantissas. 


Oy 
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NAATISL 4 


MOT ieLieg 






FLGUGeNGwz General Hardware Structure for the 


Floating Pougnt Multi pi yais it aed Ol. 


The execution time of the floating point multiplica- 
tion instruction will then be determined by the slowest of 
these three distinct and parallel operations. 

The sign computation involves just one exclusive-or 
gate gate and therefore takes a maximum of one clock cycle. 

The addition of the two exponents involves in fact 
the addition of the two exponents, followed by the subtrac- 
tion of the bias since this has also to be performed concur- 
rently with the determination of exponent overflow or 
underflow. 

From [ Ref. 7 | the addition of the contents of two 
registers using the MC68000, takes 4 clock cycles to 
complete. After this addition an extra clock cycle will be 
taken for the determination of exponent overflow and under- 


flow together with the subtraction of the extra bias. 
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Therefore it 1s concluded that the addition of the two expo- 
nents will take a maximum of 5 clock cycles. 


For the mantissas multiplication, a multiplier will 


have to be added to the processor hardware. Recorcalnd co 
"Digital Systems: Hardware Organization and Design by 
Frederick J. Hill and Gerald R. Peterson ' [Ref. 8] the 


multiplier structure that gives the best cost/performance 
tradeoff in terms of the hardware involved and the time it 
takes to perform a multiplication is a multiplier that uses 
a carry-save adder. There a carry save adder type multi- 
plier was chosen. 

Also, according to [Ref. 8:p. 361] the time that a 
carry-save adder takes to perform an N-bit multiplication 
using a adder for which each addition/shift cycle takes two 


eueck cycles is given by: 


Tmult = (N+1)Tc (6.27) 


where 
Lewis mene rhLock7cyele time 

In the case being discussed the multiplication 
involves two operands - ‘the mantissas. Each mantissa is 
24-bits long. Therefore according to the formula shown 
above, the multiplication of the two mantissas will take 25 
Emeck cycles. This makes the the multiplication the longest 
operation involved. 

Note that, the detection of a zero product can be 
geme Concurrently with the multiplication, Since a zero 
product will happen only in the case where one of the oper- 
ands 1s zero. 

The normalization must still be done sequentially. 
The normalization involves at most one left shift of the 


mantissa product anda decrement of the product exponent. 
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There is only at most one shift, since the mantissas of both 


operands are in normalized form and therefore their values 


are between O.5 and l. In the worst case, the two mantissas 
are both 0.1 (binary) and so their product will be @Gmgl 
(binary). In this case only one left shift is necessary in 


order to normalize the mantissa of the product. 

The normalization requirement that the standard 
makes on the mantissa, also dictates that any overflow or 
underflow of the exponent product does not have a possible 
recovery. 

In conclusion, the floating point multiply iInsteme- 
tion with this hardware will take approximately 26 clocks to 
complete. 

The hardware that would have to be added to the 
MC68000 would only consist of the 24 bit carry-save adder, 
the exclusive-or gate and some logic to determine overflow 
or underflow of the exponent and a zero product. 

All this hardware will be more or less equivalent to 
two of the 32-bit registers existing on the McC68000. Say 
then, that due to power dissipation limitations on the 
MC68000 two of the 32-bit data registers would then be 
removed from the MC68000, in order.to add the additional 
hardware necessary to implement the floating point muleiek 
TASteEUG tem. 

5. Dhe Model 

As stated previously, the addition of the instruc- 
tion will have some costs. One of these costs has been 
referred in the previous subsection, it is the removal of 
two of the data registers. 

As one might expect the removal of some of the 
registers from the MC68000 will have an effect on the system 
performance by reducing the number of registers accesses and 


increasing the number of main memory accesses. 


EO 


In the specific case of the application that is 
being considered, this is not true because, at most, Six Of 
the eight data registers are used at one time. Therefore, 
for this specific case, the timing costs involved due to the 
Saeamelon of the floating point multiply instruction will be 
zero. 

For each and every subroutine involved in this 
application, the execution time of the subroutine was 
computed following a worst case anda best case criteria. 
The difference between the two execution time values for 
each subroutine arises due to data dependencies on the 
number of times each instruction is executed. 

The execution times of each subroutine were then 
combined, best with best and worst with worst, in order to 
define two boundary lines for the final execution time of 
the whole program. | 

For the specific case of the floating point multiply 
Subroutine, the smallest execution time corresponds to a 
multiplication of two floating point numbers where one of 
them is zero. The longest execution time for the same 
subroutine corresponds to the multiplication of two numbers 
mere an eXponent underflow occurred after the normalization 
step. Here, for the same reason as before, the normaliza- 
tion requires at most one left shift. 

Specifically, the values obtained for the execution 
times of each subroutine are shown in Table I in terms of 
eLock cycles. | 

For the whole program the execution time will be 
dependent on the values of the data and on the number of 
emeey POolnts (N) to the Fast Fourier Transform computation. 
The values obtained in terms of clock cycles and number of 
required floating point multiplies are shown in Table II. 

The best case and the worst case execution of a 


floating point multiply subroutine takes respectively 203 


oa 


TAB EE 


EXECULION TIME OF BACH SUCRE U EI 
IN FAST FOURIER TRANSFORM PROGRAM 


peo “CASS WORST CASE 


162 
255 
1524 
Toa 
604 
14459+9MULTFP 
2407 Sey 9 MUZE? 





eZ 
180 
IL JAS: 
Ips: 
Ao 
Z0e. BOM ee Pe 
2904+ 3MURIEP 


GCEIRE 
oye as: 
NORM 
ADDEP 
MUG IEE 
SINE 
COSINE 










































AS eee 


FAST FOURIER TRANSFORM 
APPLICATION PROGRAM EXECUTION TIME 


Poet Caog WORST CASE 


572482+352MULTFP 1899074+736MULTEP 
1418194+880MULTEP 4734674+1840MULTEP 
3484658+2112MULTFP 11444210+4416MULTFP 
8198594+4928MULTFP 26770882+10304MULTFP 


18901458+11264MULTFP 6135240272355 2ZM0Lire 

42902562+25344MULTFP 1384171865299 2Z Uwe 

Pole oZZers e527 OMU bia 308440946+117760MULTFP 
2413497 /94+ 123 902MUET RE 680458176425 90/7/2N0LT EE 
469394450+270336MULTFP 1488217106+56524S3MULI EE 





and 640 clock cycles to execute. For a clock rate of 10 MHZ, 
the program execution time before the addition of the new 


TNStruction Wil sper lS ine leble tie 
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oie Lt 


FET PROGRAM EXECUTION TIME BEFORE THE ADDITION 
Crete Fon ONG POINT MULLIPLY INSTRUCTION 


N BEST WORST 
EXECUTION TIME EXECUTION TIME 
(SEC) (SEC) 
16 0.064 0. 234 
Be OMe 0.584 
64 peau ieee 
128 0.920 3. 299 
256 2.119 7.558 
512 4.805 ioae 
1024 10.762 37.957 
2048 23.865 83.694 
4096 52.427 ieee 


For the same clock rate, the program execution time 
meacer che addition of the floating point multiply instruc- 
tion is shown in Table IV. | | 

The best case is the one where the implementation of 
mae floating point multiply offers less gain. 

For the best case 

Tgain = 203 - 26 

For the worst case 

Tgain = 604 - 26 = 578 clock cycles 


As already explained, for both cases Tcost is zero. 


ivf sehock cycles 


Biwaeis Cue to Ehe fact that in the particular application 
program two of the general purpose data registers are never 
used. In the case that all general purpose data registers 
were used in the application program this would not be true. 
If this happened then there would be an increase in the 


ratio of the number of register accesses to the number of 


iS 


TABLE IV 


FET PROGRAM EXECUTION TIME AFTER THE ADDITION 
OF THE FLOATING POINT MUETIP Rar U2 er bel! 


BEST WORST 
EXECUTION TIME EXECUTION 
( SEC) (SEC) 


QO. 
OF 
O. 
OF 
if 
4. 
oe 
al 


tS 
~] 





main memory accesses, causing an increase on the average 
operand access time and an increase on the average instruc- 
ELON CxXeCCuULt on tamer 

Using the formula for the model regarding the timing 
analysis the performance effects of the addition of the 
floating point multiply instruction come as shown in Table 
Mis 

From these results one can see that the improvement 
on the MC68000 performance due to the addition of the 
floating point multiply instruction 2oxr eis secre apes 
cation varies between ten and twenty percent and 1s 
independent of the number of data points to the Fast Fourier 


Transform computation. 
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TABLE V 


PERO WwaNen Keni@io OF THE ADDITION OF THE 
PEOeL NCO r? MOELLER Y INSTRUCTION 


BEST CASE WORST CASE 
Perf Perf 


at i 
a 1 
a He 
i im 
1 les 
1 i 
1 le 
il die 
i ie 
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Veit. CONCEUSTONS 


This thesis began by making an identification and char- 
acterization of anew and controversial type of computer 
architecture called RISC for Reduced Instruction eae 
Computers. The rise of this new computer architecture and 
the discussions that followed regarding 1ts performance, 
when RISC machines are compared with CISC machines, has 
shown the need for an appropriate tool to evaluate computer 
performance from an architectural point of view. 

This thesis suggests amodel to be used by computer 
architects to determine the performance effects of an 
enhancement to a computer architecture. The computer evalu- 
ation process is important, since it generates have a quan- 
tified perception of the influences that each enhancement to 
the system architecture will have on the system performance. 
The availability of a model to do computer performance eval= 
uation 1s therefore essential in the decision-making process 
for determining which architectural features a system should 
have to optimize its performance for a certain application. 

To develop this model for the evaluation of computer 
performance, a conceptual view of what determines the system 
performance was formed. It is the author's opinion that the 
performance of a system results from the quality of the 
match between a particular application requirement and the 
architectural characteristics of the system. This match is 
done through the customization of the system instruction 
set. 

The model that is suqgested is divided into wo parce 
The first part makes a quantification of the effects thav ag 
architectural enhancement to the system has in the execution 
time of a "typical" application program. The Ssecemcdepa: emer 


the model compares the efficiency of the design of two 


He 


systems control units. In both parts the model considers 
that the application determines the number of times’ each 
instruction of the system instruction set is executed. 

BOtmethe first party, the system architecture determines 
the execution time of each instruction. For the second part, 
the system architecture determines the number of states 
through which the system control unit will have to pass 
during the execution of the application program(s). 

Finally, an example on how to use the model, in order to 
determine what are the costs and benefits of adding an 
instruction to a processor instruction set for a particular 
BoplLication, 1s given. 

The program that was used to apply the model is a bit 
misleading in the quantification of the cost/benefit ratio 
of the enhancement. This is due to the fact that in opposi- 


tion to what should be expected, the program does not use 


all the system architectural resources and so, even before 
the addition of the new, instruction does not optimize the 
system performance. If that were not the case and the 


program was an optimal one for the application of interest 
and for the processor chosen, then, surely, the enhancement 
to the system architecture would have some costs. . 

In any event and even considering that the example is a 
bit misleading, the author arrived at two criteria, each one 
derived from one of the parts of the model, for waken the 
BelgltiOn Of an instruction to a system instruction set has 
to obey so that the performance of the system for the 
pemeiecular application is increased. 

These two criteria will be applied if the new instruc- 
tion execution time is smaller than the execution time of 
the sequence of instructions that implemented the function 
before the addition of the new instruction to the system. 

For the first part of the model the criterion for the 


Bea2e1on of the new instruction, states that: 


ae 


Nnew > Tcost / Tgain 


where 

Nnew - is the number of times the new instruction 

1s executed for the particular application 

Tgain - is the difference in the execution times 

of the sequence of instructions that had 

to be executed by the system every time 

the operation was performed before the 

addition of the new instruction and the 

execution time of the new instruction. 

Tcost - is the increase in the application program 
execution time that was caused by the 

suppression of some hardware components of 

the processor 
For the second part of the model, the criterion for the 


addition of the new instruction, states that: 
Nnew > Es / Ls 


where 

Ls =~ represents the gain in the number of control 
unit states, obtained each time the operation 
performed by the the instruction and/or the 
sequence is executed. 

Es - represents the cost in the number of states 
due to the addition of the new instruction to 
the system instruction set. 

The two parts of the model need to be thoroughly checked 
and confirmed with measured values, so that their validity 


is determined. 
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Ser 


LOOP.1 
102 
LOOP2 


LOOP3 


200 


300 


MOVE. W 
ASR. W 
MOVE. W 
SUBI.W 
CLR. W 
MOVE. W 
BEQ. S 
MOVE. W 
BEQ. S 
MOVE. W 
MOVE. W 
BEQ.S 
ASR. W 
SUBI. W- 
BRA 
MOVE. W 
JSR 
MOVE. L 
MOVE. W 


MOVEQ. L 
ASL 
SUBI. L 
BCC 
MOVE. B 
LSR. L 
ROR. L 
ANDI. L 
OR. L 
MOVE. L 


APPENDIX A 


= 


FAST FOURIER TRANSFORM 


N,N2 
NZ 
NU,NU1 
#1,NU1 
K : 
NU, DO 
100 
Nz 
101 
NU iaeZ 
Kk, De 
200 
Hie 
#1,D2 
BQOEr 
Dey 
[Bie 
Rae 
N, D3 


#159,D4 
#1203 
#1,D4 
300 
+9,D5 
Do Ds 
D5Syb4 
mask, D4 
D4,D3 
DS ae N 


Tg 


;N2Z2=N/2 


; NUL=NU-1 


; K=0 

DOM TOO. L=1 , NU 

7 oO mero »i=1, NZ 

; PS eetTRGk Z**NU dy NU ) 


e 
f 


;ARG = 6.283185*P/FLOAT(N) 
MconvemeeN LO £LOat. point 
;clear D4 except exponent 

;D3 <-- FLOAT(N) 

-store FPN 


400 


MOVE. L 


MOVEQ. L 


ASL 
yee 
Bee 
MOVE. B 
pice 
ROR se 
ANDI. L 
ORG 


MOVE. L 


LEA 


LEA 


LEA 
JSR 
MOVE. L 
MOVE. B 
JSR 
LEA 
JSR 
JSR 
LEA 
JSR 
MOVE. L 
JSR 
MOVE. L 
JSR 
MOVE. L 
MOVE. W 
ADDI.W 
MOVE. W 
ADD. W 
MOVE. W 


Pes 
#159,D4 
#1,D3; 
#1,D4 
400 
#9,D5 
DoAD3 
D5,D4 
mask,D4 
D4,D3 
Boye 
FPWR, AZ 


Be Ree AL 


FPP, AO 
GETFP 
#2PI,(Al) 
#2PI1,2(Al) 
MULTFP 
FPN, AO 
CATEE 
DIVFP 

ARG, AO 
STFP 
ARG,X 
COSINE 
RESULTC 
SINE 
RESULTS 
K,K1 

#1. K1 
Iepe 
NZS 
D3,K1N2 
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;cOnvert P to float oor. 


° 
/ 


;clear D4 except exponent 
;D4 <-- FLOAT( P) 

;store FPP 

;A2Z2 points to Floating rors: 
;Working Register 

;Al points to Floating Powe: 
-Accumulator 

; FPWR <-- FPP 

7; PPACCe-—-=e ZF} 

+P PACE Sea = 571 

;FPWR <=-~- FPN 

> PPACE ==" 2PI 7h Eh 

; store ARG 

; C=COS( ARG) 

“SCoOreuwe 

; S=SIN( ARG) 

;store S 

;K1=K+1 

;KIN2Z=K1+N2 


/ 


LEA XREAL, A3 ; TREAL=XREAL( K1N2 ) *Ct+ 
+XIMAG( K1N2)*S 


LEA XIMAG, A4 
ASL. W #1,D3 ;D3 <-- 2*KIN2 
SUBI.W #2,D3 oe ee one 


ADDA. W D3 7As ; 
ADDA. W D3 , A4 . 


MOVEA.L A3,A0 ;FPWR <-- XREAL(K1N2) 
JSR GETFP 

MOVE. L (Aa (Al) ;FPACC <-- FPWR 

MOVE. B PACE Al) 

LEA CHAO FPWR <-- c 

ISR GETEFP 

JSR MULTFP ;FPACC <-- XREAL(K1N2)*C 
LEA TREAL, AO ;store partial result 
JSR Sie ; 

MOVEA.L A4,A0 ;EPWR <-- XIMAG( K1N2) 
JoR GETFP : 

MOVE. L CRO) AGL) ;FPACC <-- FPWR 

MOVE. B Dero 20a) 

LEA S,AO ;FPWR <-- S 

JSR GETEP : 

JSR MULTEP ;FPACC <-- XIMAG(K1N2)*S 
LEA TREAL, AO ;FPWR <-- partial TREAL 
JSR _ GETEP ; 

JSR ADDFP ;EPACC <== TREAL 

JSR STFP ;store TREAL 


; TIMAG=XIMAG( K1N2)*C- 
-~XREAL( K1N2)*S 


MOVEA.L A3,A0 ;FPWR <-- XREAL(K1N2) 
JSR GETFP 

MOVE. L CAD Al) :FPACC <-- FPWR 

MOVE. B DOR), 2081) -- 

LEA S,AO -FPWR <-- § 

JSR GETFP 


on 


JSR 
LEA 

JSR 
EORI.L 
MOVEA. L 
JSR 
MOVE. L 
MOVE. B 
LEA 

JSR 

JSR 

LEA 

JSR 

JSR 

JSR 


EORI 
MOVE. L 
LEA 
MOVE. L 
ASL 
SUBI. L 
ADDA 
MOVEA. L 
JSR 
MOVE. L 
MOVE. B 
MOVEA. G 
JSR 

JSR 

JSR 


EORI 
MOVE 


MULTEP 
TIMAG, AO 
Seer 
mask,(AQ) 
A4,A0 
Cele P 
(AZ),( Al) 
Zire ee al) 
C,AO 
GETEFP 
MULTEP 
TIMAG, AO 
GETEP 
ADDFP 

© as 


mask, TREAL 
TREAL, ( A3 ) 
XREAL, AS 
aly 18S 
#1,D3 
#2,D3 
D3,A5 
A5,A0 
GETFP 
22) 
Cee a) 
A3, AO 
GETFP 
ADDFP 

STEP 


mask, TIMAG 
TIMAG, ( A4) 


SZ 


;FPACC <\-= XREAL( K1N2)*S 
;store partial result 
;change sign of TIMAG 
;EFPWR <-- XIMAG( KINZ ) 
;FPACC <-- FPWR 

> EPWRo<==- C 

;FPACC <== XIMAG(KINZ) -e 
;FPWR <== partial TIMAG 
;FPACC <== TIMAG 

;store TIMAG 

; XREAL( KIN2 )=XREAL( K1) -TREAL 
;change sign of TREAL 
;XREAL( KINZ) <-- TREAL 
;EFPWR <-- XREAL(K1) 

>FPACC <-- FPWR 

;FPWR <-- XREAL( K1NZ2) 
;FPACC <-- XREAL( K1)-TREAL 
-store 
;XIMAG( KIN2 )=XIMAG( K1)- 


; -TIMAG 
;change sign of TIMAG 
;XIMAG( K1IN2) <-- =-TIMAG 


LEA 
ADDA. L 
MOVEA. L 
JSR 
MOVE. L 
MOVE. B 
MOVEA. L 
JSR 

JSR 

JSR 


EORI 
LEA 
JSR 
MOVE. L 
MOVE. B 
MOVEA. L 
JSR 
JSR 
JSR 


EORI 
LEA 
ISR 
MOVE. L 
MOVE. B 
MOVEA. L 
JSR 
ISR 
ISR 
ADDI. W 
SUBO. W 
BRA 


XIMAG, A6 
D3,A6 

A6, AO 
GETFP 
(AZ)7(Al) 
O( AQ, 2( Al) 
A4,A0 
GETFP 

ADDFP 

STFP 


mask, TREAL 
TREAL, AQ 
Cela e 
(A2Z),(AL) 
Zi yz) ) 
A5,A0O 
GETFP 
ADDFP 

Se 


mask, TIMAG 
TIMAG, AO 
GETFP 
(AZ),(A1) 
ZCAZ), 2 Ad ) 
A6, AQ 
CH er 
ADDFP 

Some 

#1,K 

#1,D1 
LOOP2 


og 


;A6 <--> XIMAG(K1) 

; FPWR <=-- XIMAG(K1) 

;FPACC <== FPWR 

;FPWR <== XIMAG( KINZ) 
;FPACC <== XIMAG( K1iNZ ) 
;store 

; XREAL( K1 )=XREAL( K1)+ 

; +TREAL 
;change sign of -TREAL 
;FPWR <-=- TREAL 

;FPACC <== FPWR 

; FPWR <== XREAL( K1) 
PEPAGGe<—-— final AREAL( Kl) 
; store 

; XIMAG( K1 )=XIMAG( K1)+ 

: +TIMAG 
;change sign of -TIMAG 
;FPWR <== TIMAG 

;FPACC <== FPWR 

;EPWR <== partial XIMAG(K1) 
;EPACC <-- final XIMAG(K1) 
; store 

>K=K+1] 


/ 


aa! 


100 


LOOP4 


MOVE. W 
ADD. W 
MOVE. W 
CMP. W 
BMI 
CLR. W 
SUBI.W 
ASR. W 
SUBQ. W 
BRA 
MOVE. W 
MOVE. W 
BEQ.S 
MOVE. W 
SUBI.W 
JSR 
MOVE. W 
ADDI.W 
CMP. W 
BPL 
LEA 
LEA 
MOVE. W 
ASR 
SUBI.W 


MOVEA St 
MOVEA. L 


MOVE. W 
ASR 

SUBI 
ADDA. 
ADDA. 
ADDA. 
ADDA. 
MOVE. 


eet Ft ee 


NZ bal ; K=K+N2 

K,D1 E 

alm : 

Ny Da ;IE C(K.LT. NN) Comer 102 
OZ : 

K ;K=0 

#1,NU1 ;NU1L=NU1-1 

NZ ;N2Z=N2/2 

#1,D0O i 
LOOP 1 ; 
Neo : 
$f hare)! >DO 
TOs ; 
Pay ; L=PErre eel, NU) ei 
#1, J i 

eit ; 

eee! ; 

f#1,l ; 

Gee! 7 le “Ghete. k) (CO 110s 
1003 ; 

AREAL, A3 ; TREAL=XREAL( K) 
XIMAG, A4 ; 

Daa : 

#1,D2 : 

TZ De : 

A3,A5 ; 

A4,A6 ; 

Lee : 

#1,D3 ; 

#2,D3 ; 

D1,A3 ;A3 --> XREAL(K) 
D1,A5 ;A5 --> XIMAG(K) 
D2,A4 ;A4 --> XREAL( I) 
D2,A6 ;A6 <--> XIMAG(I) 
(A3),TREAL ; 


103 K=1,N 
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1003 


HOS 


MOVE. 
MOVE. 
MOVE. 
MOVE. 
MOVE. 
ADDQ. 
SUBQ. 


BRA 
RIS 


ewes tp! it) iy ait 


(AS), TIMAG 
(A4),(A3) 
(A6),(A5) 
TREAL, (A4) 
TIMAG, (A6) 
#1,D1 
#1,D0 
LOOP4 


BS 


; TIMAG=XIMAG( K) 
» XREAL( K)=XREAL( I) 
; XIMAG( K)=XIMAG( I) 
; XREAL( 1) =TREAL 
; XIMAG( I )=TIMAG 


; RETURN 


Leo 


LOOP 


2000 


MOVEM. L 


MOVE. W 
CLR. W 
MOVE. W 
BEQ. S 
MOVE. W 
ASR. W 
MOVE. W 


ASL. W 
MOVE. W 
SUB. W 
ASL 
ADD. W 
MOVE. W 
SUB. 
BRA 


MOVEM. L 


Reo 


See END. 
PELTR” BUNGMEOr 


Bema, -( AT) 
weet 

IBIT 

NU, DO 

2000 

Jie 

#1,D1 

ie 


#1,D2 
a7 ABE 

D2,D3 

IBIT 

be, iBia 
Bind 

#1,D0 

LOOP 
(A7)+,DO-D3 
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;save registers 

;J1l=J 

; IBITR=O 

;DO 200 I=1,NU 

;J2Z=J1/2 

;D2 <== JZ 

» LE LR=TBieRe ZC Ie=Z7 ei 


ote ee 


;restore registers 
; RETURN 


SINE 


OO 


200 


300 


APPBNDIA C 
SLNE Se FUNCTION 


MOVEM. L DO-D4, -(A7) ;save registers 
MOVE. L a DO ; 

BTST. L #bit,X ;test sign of X 
BNE 100 ; 

MOVE. B #-1,SGN ;9GN <-- -l 

BCHG #bit,DO ;DO <== -DO 

BRA 2900 ; 

MOVE. B #+1,SGN ;SGN <-- 1 

MOVE. L DO. ;Y <== DO 

CMP ri YMAX, DO ;YMAX = DO 

BEL SCO ; 

SIS IONS Whe veveS 

MOVEA. L Y,AO ;AQ --> Y 

JSR GETEP ;EPWR <-- Y 

MOVE. L 7 Pee.) ;FPACC <== inverse of pi 
MOVE. B L/P ieee Al ) . 

JSR MULTEP ;ERPREGs<-- Y/PiI 
MOVEA. L /EleeO ;AQ0 <--> Y/PI 

Jr Ser ;stomesy /Pi 

MOVE. L Eel 7D ise Y /P I 

MOVE. L Di, PZ : 

ANDI. L mask,D1 ;D1 <-- mantissa 
Bast +12) toe sinsert hidden bit 
LSR eZ ;hi D2 has exponent 
SWAP D2 ;lo D2 has exponent 
SsM¥BI.B #127,D2 ;extract bias 

BEL 400 ;1£ positive go to 400 
MOVE. W +#0O,N ;clear N 

BRA 500 ; 

BNE 600 Jiao noe sO tO 600 


ey 


600 


500 


700 


MOVE. W 
BRA 
Po lisela 


ANDI 
Nera 
SWAP 
MOVE. W 
MOVE. 
BIST. B 
BEQ 


Es 


BCHG 
MOVE. L 
ANDI 
MOVEA. L 
JSR 
MOVE. L 
MOVES 
JSR 
MOVEA. L 
JSR 

JSR 
MOVEA. L 
JSR 
MOVEA. L 
JSR 
MOVE. L 
MOVE. B 
JSR 
MOVEA. L 
JSR 

JSR 
MOVEA. L 
JSR 


#1,N 
300 
DZ 


mask,Dl 
#7,D1 
Dl 

BipN 
wel AN 
#0 ,N 
goo 


#7, SGN 
X, |X| 
mask, |X] 
XN, AO 
GETEP 
-C1,(Al1) 
-C1,2(Al1) 
MULTEP 
|X|, AO 
GETEP 
ADDEP 
TEMP, AO 
ree 

AN, AQ 
GETEP 
-C2,(Al) 
=C2;,2(-AN) 
MULTEP 
TEMES 
Cir. 
ADDEP 
E,AO 
STEP 
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ee 

;shift left mantissa by 
;exponent value, max = 8 
; leave only integer part 


*mantissa in embil 


;N <== integer of mantissa 
;XN <-=- FLOAT(N) 
;N even ? 


,if Even Cdo HNoraund 
;otherwise 

;change sign of SGN 
;determine F 
-clearesign Dit 
;EPWR <-=- XN 

+S PACGwc=— oC 
s>EPACC <== =-( XN*Cl1) 
;EPWR <\-=- |X| 
;EFPACC <== |X|=-( XN*Cl1) 
;store FPACC 

; FPWR <== XN 


e 
/ 


;FPACC <-- -C2 
;FPACC <-- -(XN*C2) 
;FPWR <-- |X|-(XN*C1) 


Ee Ce-— 4 


-StCOLee. 


e 
f 


MOVE. L 
AID. is 
CMPi. i 
BMI 


MOVEA. L 
JSR 
MOVE. 
MOVE. 
JSR 


ae Gs 


MOVE. 
MOVE. 
MOVE. 
MOVE. 
JSR 
MOVEA. L 
JSR 
MOVE: & 
MOV Ee 5 
JSR 
MOVEA. L 
JSR 

JSR 
MOVE. L 
MOVE. B 
JSR 
MOVEA. L 
JSR 

JSR 
MOVE. L 
MOVE. B 
JSR 
MOVEA. L 


Oe! We 


ed 
mask, |F| 
|F|,#eps 
800 


F,AO 

GETFP 

Caz Cal) 
2(A2),2(Al) 
MULTFP 


Ae Z ) 
eye D2 ) 
R4,(Al) 
R4,2(Al) 
MULTFP 
G,AO 
STFP 
R3,(A2) 
Rem) 
ADDFP 
G,AO 
GETFP 
MULTEP 
Roale) 
Roe) 
ADDFP 
G,AO 
GETFP 
MULTFP 
Raa) 
ee 2 (2) 
ADDFP 
G,AO 
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aes 
Peledinwsagn bit 

Pie} = eps 

yomaneh tf |t | <eps 
;otherwise 


;determine R(g) 

;FPWR <-=- F 

;FPACC <-- F 

;FPACC <== F*F 

>G = FFF 

»-FPWR <-- G 

;FPACC <-- r4 

eee = LarG 

;>store G 

;FPWR <-<=- x3 

;FPACC <= r4*Gtr3 
Sikee-— CG 

;FPACC <-- (r4*Gtr3)*G 
;FPWR <-<=- r2 

;FPACC <-- (r4*Gtr3)*Gtr2 
;FPWR <-=- G 


e 
é 


-PRAGG <-- (({ )*G+r2)*G 
;FPWR <-- rl 
;FPACC <-- ( )*Gtrl 


;EPWR <-~ G 


800 
900 


DONE 


JSR 

JSR 
MOVEA. L 
JSR 

JSR 

JSR 
MOVEA. L 
Joi 

BRA 
MOVE. L 
MOVE. B 
Bib 


MOVE. L 
BCHG 
MOVE. L 
MOVEM. L 
RTS 


CELEE 
Meith P 
EF, AO 
GETFP 
Meine ee 
ADBEP 
RESULT, AO 
ee EP 

900 
Bou be 
SGN,D3 
DONE 


RESULT, D4 

#31,D4 

D4, RESULT 
(A7)+,DO-D4 
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; EP ACC ser Rig) 

;EPWR <-- F 

;EPACC <-=- F*R(g) 
;FPACC <\-=- F*R(g)t+F 
;store result 

s;result <-<- F 

;test value of SGN 

,1f PostElve do nNoOehing 
;otherwise 

;change sign of result 
;restore registers 


;Deturny co Main pwecqram 


COSINE 


100 


APPENDIX D 


COSINE FUNCTION 


MOVEM. L DO=-D4,-(A7) ;save registers 
MOVE. B #1,SGN ,eGNe<—= 1 

MOVE. L ee ; |X| <-- x 

ANDI mask, |X| ;clear sign bit 
MOVEA. L |X] ,A0 ; FPWR <-- |X| 

JSR Chine ; 

MOVE. L PI/7Z weal) FEPACC <=-= PI/z 
MOVE. B Bie 2 AL ) ' 

JOR ADDFP Ee ACE <==). | GePI/Z2 
MOVEA. L yY,A0O ;store Y 

JSR See ; 

MOVE. L Do ;DO <== Y 

CMP. L YMAX, DO ;YMAX = DO 

BPL LOC : 

error message 

MOVEA. L Y, AO ;AQ --> Y 

JSR GETEP ; FPWR <\-- Y 

MOVE. L 1/PI OAL) ;FPACC <== inverse of pil 
MOVE. B el, Ze) ; 

JSR MUBTEP pREeeeo-——— Y/Pi 
MOVEA. L Y/EL,A0 ;AO -=-> Y/PI 

Jor Sl Ee Stoney F | 

MOVE. L Y7 FP Ig jy eee en FL 

MOVE. L Diz ; 

ANDI. L mask,D1 ;D1l <-- mantissa 
Bol lol Uy IDI sinsert hidden bit 
ek + eZ shi D2 has exponent 
SWAP D2 ;lo D2 has exponent 
SUBI.B ey ey, sextract bias 

BEE ZO0 PoC Lerye GO to ZOO 


200 


400 


200 


200 


MOVE. W 
BRA 
BNE 
MOVE. W 
BRA 
Aol, 


ANDI 
ASR. B 
SWAP 
MOVE. W 
MOVE. L 
BYst.s 
BEQ 


BCHG 
MOVEA. L 
JSR 
MOVE. L 
MOVE. B 
JSR 

JSR 


MOVEA. L 
JSR 
MOVE. L 
MOVE. B 
JSR 
MOVEA. L 
JSR 

JSR 
MOVEA. L 
JSR 
MOVEA. L 
JSR 


+#0,N 
300 
400 
#1,N 
2018 
DZ. 


mask,D1l 
#72 D1 
Di 

Dia 
Ye! 
#0,N 
200 


#7, SGN 

XN, AO 
GETFP 
#-.5,(Al) 
#-.5,2(Al) 
ADDFP 

STFP 


XN, AO 
GETFP 
Ses (NY 
=Cip2ca 
MULTEP 
1X|,A0 
GETFP 
ADDFP 
TEMP , AO 
STFP 

XN, AO 
GETFP 


OZ 


sclear N 

;1f zero go to 400 

sN <-- 1 

;shift left mantissa by 
;exponent value, max = 8 

; leave only integer part 
smantissa in lo Dl 

;N <-- integer of mantissa 
;XN <-- FLOAT(N) 

;N even ? 

if CVensdo novning 
;otherwise 

;change sign of SGN 

;FPWR <-- XN 

*>FPACC <== .5 

sFPACC <=--XN-. 5 

;store XN 

;determine F 

;FPWR <-- XN 

*FPACC <-= Cl 


° 
/ 


-FPACGM==— -( XN*Cam 
-FPWR <-- [X|] 
-SeRCCr ==) lap e 1) 


“Store SeAcc 
>FPWR <== XN 


/ 


MOVE. 


L 


MOVE. B 


JSR 


MOVEA. L 


JSR 
JSR 


MOVEA. L 


JSR 


MOVE. 
AND ie 
CHE tf. 


BMI 


MOVEA. L 


JSR 


MOVE. 
MOVE. 


JSR 


MOVE. 
MOVE. 
MOVE. 
MOVE. 


JSR 


Die) Fee 


JSR 


MOVE. 
MOVE. 


JSR 


MOVEA. L 


JSR 
JSR 


MOVE. 
MOVE. 


L 
L 
L 


L 
B 


ODeoew 


L 
B 


L 
B 


=e, (AL) 
eZ ee) 
MULTFP 
TEMP, AO 
CEIEE 
ADDEP 
E,AO 
SLEEP 
Be 
mask, |E| 
|F|,#eps 
600 


F,AO 
GETFP 
(A2),(A1) 
Qi AZ AL) 
MULTFP 


eel) eae 
DCm fe 
R4,(Al) 
R4,2(Al) 
MULTFP 
G,AO 

STFP 
R3,(A2) 
Romenes 
ADDFP 
G,AO 
GETFP 
MULTEP 
eee) 
RAO) 


Shs 


Teplice = C2 
ae eeme-— —( SNe?) 
Show <== |< l=@Ne cl ) 


pHEACCG <-- F 
»store F 
ee 
yomenr Signe bit 
eta 
poOLaned 16 ~| t,| 


; otherwise 


eps 


< eps 


;determine R(g) 
;EPWR <-- EF 
;FPACC <-- F 
;FPACC <-- 
;G = F*F 

; FPWR <-- G 


Bon 


;FPACC <== r4 
fe eee, <== 


;store G 


r4*G 


; FPWR <-- x3 
Ponce. <== 
peak —. GC 


mone r eS 


>FPACC <-- (r4*G+r3)*G 
;FPWR <-- r2 


* 
/ 


600 
700 


DONE 


JSR ADDFP 
MOVEA. L G,AO 


JSR GETFP 
JSR MULTEP 
MOVE. L RICA? ) 
MOVE. B Ramen) 
JSR ADDFP 
MOVEA.L G,AO 

JSR GETFP 

JSR MULTEP 
MOVEA.L F,AO 

TSR GETFP 

JSR MULTEP 
JSR ADDFP 
MOVEA.L RESULT, AO 
JSR STFP 

BRA 700 

MOVE. L F, RESULT 
MOVE. B SGN,D3 
BPL DONE 

MOVE. L RESULT, D4 
BCHG #31,D4 
MOVE. L D4, RESULT 
MOVEM. L (A7)+,DO-D4 
RTS 
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> FPACGe<==— ( reé-Caes Cire 
; FPWR <-- G 

,PPACG Ra —— (1 Vere 2 kG 
;FPWR <-= rl 

 PPACC <== i V*Ger lL 

; FPWR <-- G 

;FPACC <-- R(g) 

; FPWR <-- F 

> BPACC === F*R( Gg) 

7 HEACE. <== F-Ricg) + 
;store result 

;result <== F 

;test value of SGN 

LEV Desilueime, do Noeniag 
;otherwise 

Chengde = sino: saeco le 
;restore registers 


;return to main program 
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